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1. INTRODUCTION 

Location-aware services need position information to carry out a specific task. In outdoor 
environments, one can use the global positioning system (GPS) equipped devices to get position information, 
but it is hard to use GPS in indoor environments. Recent researches have shown that WiFi indoor positioning 
systems (WiFi-IPSs) are very promising for those services. Many Wifi-IPS algorithms and systems have 
been proposed so far, but we can categorize them as range-based and range-free algorithms [1]. The WiFi 
RSSI fingerprinting systems (or range-free systems) that use WiFi signals from the surrounding wireless 
access points (APs) to provide the object's location information. This method eases the deployment at a low 
cost, and they require no extra infrastructure. Researches have shown that WiFi fingerprinting technology 
using the received signal strengths received from WiFi access points is a very promising method for indoor 
positioning [2]. However, this method causes many difficulties as Wi-Fi RSSI suffers from multipath and 
shadowing interferences in indoor dynamic environments [3]. 

Therefore, the measured RSSI value is not stable and highly depends on the measuring environment 
and surrounding objects. RSSI positioning estimates also have relatively low accuracy and security [4]. In 
particular, when predicting the movement trajectory of a person or device indoor environment, the more 
mobility, the bigger error is. Many methods have been proposed to overcome this limitation. For example, 
the use of the average of many selected maximum RSSI observations [5]. It uses a smoothness index to test 
the quality of RSSI to select an appropriate number of RSSI observations. A multi-point fingerprint matching 
algorithm has also been proposed, in which the common single-point matching procedure is expanded into 
multiple points [6]. 
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Traditional ensemble methods combine the predictions from several base estimators to improve the 
generalizability and robustness over a single estimator. In averaging methods, the principle is to build several 
estimators independently and then to average their predictions. In contrast, the boosting methods, base 
estimators are built sequentially and one tries to reduce the bias of the combined estimator. Singh et al. [7] 
propose a method using an ensemble of classifiers on weighted averages of WiFi RSSI values within a time 
window to localize a user in an apartment. Akram et al. [8] combine gaussian mixture model (GMM)-based 
soft clustering and random decision forest (RDF) ensembles on WiFi fingerprints to solve the indoor 
localization at both room-level and latitude-longitude prediction. Lee et al. [9] employ random forest 
ensemble learning method on WiFi RSSI to locate the location of a user. Some authors try to employ the 
particle swarm optimization together with ensemble learning to solve the indoor localization using ultra- 
wide-band signals (UWB) [10]. Some researchers try to make modification to traditional ensemble learning 
models or combine them in different ways to find out position of a user in indoor environments based on 
WiFi RSSI signals [11-16]. 

This research proposes a new ensemble model. In this model, an intermediate classifier is used to 
select the best model for each test point and its prediction from base models. The best model is defined as the 
model that gives the smallest prediction error. This allows the new model to choose the best estimator for 
each test data point flexibly. 


2. RESEARCH METHOD 
2.1. WiFi fingerprint positioning 

Suppose that each location has a particular set of WiFi signal strength from APs measured in the 
offline phase, called an offline fingerprint, then the fingerprint got during the online phase is compared to the 
offline fingerprints stored in the database to estimate the position of an object. In the offline phase, each 
reference point includes signal strength measured from all accessible APs together with known 2D 
coordinates. When an object enters the region, they compare its current measured RSSI data and stored 
offline data to infer the position of the object [2, 17-20]. 

Wi-Fi RSSI Fingerprint data from surrounding access points formed a map for an area with some 
probability distribution of RSSI values at each given location (x, y). In most methods, RSSI values of the 
online phase are compared with those stored in the database in the surveying phase to find the closest match, 
then the position (x, y) is predicted based on this match [21]. A fingerprint is a set of signal strengths from 
surrounding access points over time at a given location (called reference point). These fingerprints have some 
relation to locations associated with them, so they can be used to distinguish those locations. When applying 
machine learning models to this kind of problem, most methods perform two separate phases. In the first 
phase, the training phase, multiple WIFI RSSI fingerprints scanned at each reference point together with its 
coordinates are used to train the learning model, and they are also recorded to the database for future uses. In 
the second online phase, it forwards a new scanned RSSI fingerprint at the unknown location to the trained 
model to predict the unknown position (x, y). The most commonly used estimation method is the K-Nearest 
Neighbor (KNN). More complicated methods include support vector machine (SVM), deep neural networks 
(DNN), the hidden markov model (HMM), and Gaussian Process Assisted are also implemented [2, 22-25]. 

Now, let us formalize the methods using a mathematical model. Assume that, the signal strengths at 
N points are measured together with their coordinates, respectively in a room (maybe a squared division or 
randomized). Normally, people will use 80% of that data set for training and the remaining 20% is used for 
testing purposes. Each point Pj in Figure | has the corresponding coordinate (xi, yi) and a set of RSSI values 
from access points: Pi(xi, y;, RSSIi), with RSSIi = (r}, rf, ..., mi,) are RSSI values from M access points and r} 
(j=1.. M) is the RSSI of the j access point at the point i. The Euclidean distance from point P; to any point P; 
is calculated by the formula: 


() 


The point P; coordinates are predicted based on the K nearest points. More complicated algorithms 
such as deep neural network (DNN), support vector regression (SVR), and other modern learning models can 
be applied to give better accuracy but, we have to pay more intensive computation. 
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Figure |. Actual experimental environment and the setup for training data points 


2.2. Proposed ensemble method 

After training base models (KNN, DNN, and random forest (RF)), we use these models to predict 
test points' coordinates, we realize that prediction errors of a single point from different models are also 
different. Which means that a model is good for a subset of test points. Therefore, we may think about 
building an intermediate classifier to classify which a point is the best fit with a specific base model (KNN, 
DNN, and RF). Our proposed ensemble model is illustrated in Figure 2. 


Base Models predict Test 
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Final Prediction 
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Test prediction from SVC 


Figure 2. The proposed new ensemble model 


Specifically, we divide the RSSI dataset measured in the offline phase into three different data sets: 
training, validation, and test sets. The training set is used to train three base models. The three trained base 
models are used to predict the validation set. The validation prediction errors and the corresponding best fit 
models are then used to train the support vector classifier (SVC) as an intermediate classifier (each data point 
is labeled with an integer that represents a base model index in the base model list). Each point in the test set 
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is forwarded to the trained SVC model to find an appropriate base model for the final prediction for the test 

point. In short, our new ensemble works with the following steps: 

1. Step 1: New ensemble model initialization 

— Initialize base regression machine learning models 

— Initialize the SVC intermediate classification model 

2. Step 2: Training new ensemble model 
Use the training data set to train for base regression models 

— Make prediction on the validation data set from trained base models and evaluate the errors 

— Label each data point in the validation set with corresponding base model that gives smallest error for 
that validation point 

— Validation data set together with base model labels are used to train the intermediate SVC model 

3. Step 3: New ensemble prediction 

— Forward each test data input into the SVC model to get the best model from the intermediate classifier 

— Use the selected best base model to make prediction and get the final best predicted coordinates for each 
test data point 


2.3. Validation data labeling 
Table 1 shows some sample validation data points used to forward to base models. The 


corresponding minimum error is used to label the validation data that are used to train the intermediate 
classifier (SVC). 


Table 1. Data labeling for intermediate classifier 
Model Prediction Error (m) 
KNN (0) DNN(1)_ RF(2) 


Validation Data Points Label (Index of the model giving minimum error) 


1 1.99 2.97 2.43 0 
2 1.00 0.47 1.78 1 
3 0.53 1.08 1.12 0 
4 2.36 4.90 1.35 2 
5 2.89 1.70 2.13 1 
6 1.43 0.46 0.50 1 
7 0.84 0.75 0.74 2 
8 1.19 1.58 0.82 2 
9 2.86 4.30 3.16 0 
10 2.10 1.15 1.85 1 


3. RESULTS AND DISCUSSION 

The experiments are conducted in our lab room with a dimension of 10 by 10 meters, and the room 
is divided into grids of 9 by 9 points to measure the WiFi RSSI fingerprints of surrounding access points in 
the surveying phase. The room contains tables, chairs, computers, and other networking devices as well as 
human beings working in the room. In this specific scenario, we use RSSI values from 9 access points 
surrounding the room and nearby rooms (there are about 600 measured points). Validation dataset is 
extracted from the measured training points and they are excluded from training dataset. For the testing data, 
we further measure RSSI fingerprints of random points in the room. To evaluate the performance of our 
proposed algorithms, the Euclidean distance between the estimated and true location is used to measure the 
error. Let (xi, yi) be the true 2-D physical coordinates and (X, , ¥, )be the estimated location of point Pi, 
respectively, then the distance error Errg is computed: 


Brrg = Ta ROI ao 

For the experiments, we use python sklearn libraries with three base models with the hyper- 
parameter configurations are: RandomForestRegressor (random_state=0,  n_estimators=300), 
KNeighborsRegressor (n_neighbors=7, weights='distance'), and MLPRegressor (hidden_layer_sizes = 
(128,64,32,16), activation='relu', solver='adam', batch_size=3, max_iter=300, random_state=0). For the 
intermediate classifier, we use SVC (random_state=0). The VotingRegressor model uses the same three base 
models as in our proposed ensemble model. 

Figure 3 shows the mean error for each base model and the new ensemble model (the data is also 
showed in Table 2). From the figure, one can easily find that the new ensemble model gets very good 
accuracy in comparison to each base model used. Figure 4 represents numerical analysis about the base 
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model selection (the data is also showed in Table 3). For each base model, the percentage number shows the 
rate at which a model gives the smallest error among three base models on the test data. The percentage 
number of the new ensemble model shows the rate that the intermediate classification model (SVC) correctly 
classifies the base model which gives the smallest error. In this specific scenario, it achieves 60.38% ratio 
(higher than that of each base model’s). 


Table 2. Mean error of base models and our 
proposed ensemble model 
Model Mean Error (m) 

New Ensemble 1.10 

DNN 1.30 

a KNN 1.65 

£ RandomForest 1.50 
° 
o 
= 
@ 
E 

Models 
Figure 3. Mean error of base models and our 
proposed ensemble model 
70 ~—+MJTTa————4 Table 3. The ratio rate (%) at which a model 
60.38 correctly gives the minimum error 
Model Ratio (%) 

New Ensemble 60.38 

DNN 41.51 

KNN 26.42 


RandomForest 32.08 


32.08 
26.42 


Models 


Figure 4. The ratio rate (%) at which a model 
correctly gives the minimum error 


The cumulative error distribution for each base model and the new ensemble model is illustrated in 
Figure 5. We also evaluated our proposed ensemble model with the other ensemble learning models such as 
VotingRegressor and ExtraTreeRegressor, and the comparison result is presented in Figure 6 (the data is also 
showed in Table 4Table 2). From the figure, it is very clear that our proposed ensemble model has very good 
accuracy in comparison to other ensemble learning models for this specific WiFi RSSI dataset. From the 
experiments, we also realized that the validation dataset should be large enough to thoroughly train the 
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intermediate classifier model so that the proposed ensemble model can give good prediction on the testing 
data. 


Cumulative Error Distribution (%) 
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Figure 5. Error distribution of models Figure 6. Mean error of ensemble models 
Table 4. Mean error of ensemble models 
Ensemble Model Mean Error (m) 
New Ensemble 1.10 
VotingRegressor 1.42 
ExtraTreeRegressor 1.46 
CONCLUSION 


The paper already analyzed our proposed new ensemble learning model. Compared to traditional 


ensemble models, our proposed model uses an intermediate classification model to train a validation data set, 
then the trained classifier is used to select the best model for each test data point. The intensive experiments 
have confirmed that our model has better accuracy in comparison to traditional ensemble models as well as 
base models. 
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